In this exercise, we analyse listings from Airbnb bookings (via insideairbnb.com) to better understand price patterns within a particular city. The city we examined was Cape Town and we tried to predict the total cost of four nights for two people To kick things off, we performed EDA to examine the data in more detail. Next, we subset the data and performed feature engineering to create new features which may suitable for further modeling. From this exercise, we move to regression analysis where we iteratively test out a few indicators to see if they are significant in explaining the variability within the dataset. We build up on this to create a final model of 7 covariates (property type, room type, review ratings scores, cancellation policy, how many people the property accommodates, which neighbourhood the property is in and if the property has air conditioning installed. This model achieves a R squared of 0.512. Finally, we use this model to make a prediction on unseen problem.
This data set contains information on publicly available Airbnb listings in the city of Cape Town, South Africa. It was retreived from insideairbnb.com Inside Airbnb is an independent, non-commercial website that offers a set of tools and data that allows you to explore how Airbnb is being used in cities around the world. The original data was scraped from the the Airbnb website (www.airbnb.com). The raw scraping data was then analyzed, cleaned and aggregated where appropriate to faciliate the public discussion and made available here: http://data.insideairbnb.com/south-africa/wc/cape-town/2020-06-21/data/listings.csv.gz
First, we will take a look at our data.
## Rows: 24,062
## Columns: 106
## $ id <dbl> 3191, 15007, 15068, 1507…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped <date> 2020-06-23, 2020-06-24,…
## $ name <chr> "Malleson Garden Cottage…
## $ summary <chr> "This is a lovely, separ…
## $ space <chr> "Beautiful self catering…
## $ description <chr> "This is a lovely, separ…
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> "Mowbray is on the South…
## $ notes <chr> "There are pets on the p…
## $ transit <chr> "The train station is ab…
## $ access <chr> "Fully self-contained co…
## $ interaction <chr> "I live in the main hous…
## $ house_rules <chr> "There is no smoking in …
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 3754, 59072, 59318, 5934…
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "Brigitte", "Dirk", "Lin…
## $ host_since <date> 2008-10-21, 2009-12-01,…
## $ host_location <chr> "Cape Town, Western Cape…
## $ host_about <chr> "I'm single and love to …
## $ host_response_time <chr> NA, "within a day", NA, …
## $ host_response_rate <chr> NA, "100%", NA, "100%", …
## $ host_acceptance_rate <chr> "100%", "100%", "33%", "…
## $ host_is_superhost <lgl> TRUE, TRUE, FALSE, FALSE…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_listings_count <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_total_listings_count <dbl> 1, 11, 13, 5, 1, 1, 3, 1…
## $ host_verifications <chr> "['email', 'phone', 'rev…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ street <chr> "Southern Suburbs, Weste…
## $ neighbourhood <lgl> NA, NA, NA, NA, NA, NA, …
## $ neighbourhood_cleansed <chr> "Ward 57", "Ward 23", "W…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Southern Suburbs", "Cap…
## $ state <chr> "Western Cape", "Western…
## $ zipcode <dbl> 7700, 7441, 7441, 7441, …
## $ market <chr> "Cape Town", "Cape Town"…
## $ smart_location <chr> "Southern Suburbs, South…
## $ country_code <chr> "ZA", "ZA", "ZA", "ZA", …
## $ country <chr> "South Africa", "South A…
## $ latitude <dbl> -33.9, -33.8, -33.8, -33…
## $ longitude <dbl> 18.5, 18.5, 18.5, 18.5, …
## $ is_location_exact <lgl> TRUE, TRUE, FALSE, TRUE,…
## $ property_type <chr> "Guesthouse", "House", "…
## $ room_type <chr> "Entire home/apt", "Enti…
## $ accommodates <dbl> 2, 6, 6, 2, 2, 2, 4, 3, …
## $ bathrooms <dbl> 1.0, 3.0, 2.0, 1.5, 1.0,…
## $ bedrooms <dbl> 1, 3, 3, 1, 1, 1, 2, 1, …
## $ beds <dbl> 1, 4, 5, 2, 1, 1, 4, 1, …
## $ bed_type <chr> "Real Bed", "Real Bed", …
## $ amenities <chr> "{TV,\"Cable TV\",Wifi,\…
## $ square_feet <dbl> NA, NA, NA, NA, NA, NA, …
## $ price <chr> "$815.00", "$2,038.00", …
## $ weekly_price <chr> "$5,204.00", "$9,000.00"…
## $ monthly_price <chr> "$20,816.00", "$40,000.0…
## $ security_deposit <chr> NA, "$2,500.00", "$3,000…
## $ cleaning_fee <chr> "$260.00", "$850.00", "$…
## $ guests_included <dbl> 2, 6, 1, 2, 2, 3, 2, 1, …
## $ extra_people <chr> "$0.00", "$250.00", "$0.…
## $ minimum_nights <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_nights <dbl> 730, 120, 730, 1125, 730…
## $ minimum_minimum_nights <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ maximum_minimum_nights <dbl> 3, 2, 4, 5, 2, 14, 2, 2,…
## $ minimum_maximum_nights <dbl> 730, 120, 730, 1125, 730…
## $ maximum_maximum_nights <dbl> 730, 120, 730, 1125, 730…
## $ minimum_nights_avg_ntm <dbl> 3.0, 2.0, 4.0, 5.0, 2.0,…
## $ maximum_nights_avg_ntm <dbl> 730, 120, 730, 1125, 730…
## $ calendar_updated <chr> "7 months ago", "4 month…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 30, 0, 30, 0, 30, 30, 0,…
## $ availability_60 <dbl> 60, 0, 60, 0, 60, 60, 0,…
## $ availability_90 <dbl> 90, 0, 90, 0, 90, 90, 0,…
## $ availability_365 <dbl> 365, 0, 365, 234, 365, 3…
## $ calendar_last_scraped <date> 2020-06-23, 2020-06-24,…
## $ number_of_reviews <dbl> 54, 27, 0, 5, 0, 2, 8, 3…
## $ number_of_reviews_ltm <dbl> 6, 1, 0, 0, 0, 0, 1, 0, …
## $ first_review <date> 2013-05-31, 2013-12-15,…
## $ last_review <date> 2020-03-07, 2019-08-18,…
## $ review_scores_rating <dbl> 97, 97, NA, 100, NA, 90,…
## $ review_scores_accuracy <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_cleanliness <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ review_scores_checkin <dbl> 10, 10, NA, 10, NA, 9, 1…
## $ review_scores_communication <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_location <dbl> 10, 10, NA, 10, NA, 10, …
## $ review_scores_value <dbl> 10, 10, NA, 10, NA, 8, 1…
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <dbl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable <lgl> TRUE, FALSE, FALSE, FALS…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "moderate", "strict_14_w…
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count <dbl> 1, 4, 10, 5, 1, 1, 3, 1,…
## $ calculated_host_listings_count_entire_homes <dbl> 1, 4, 10, 1, 1, 1, 2, 1,…
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 0, 4, 0, 0, 0, 0, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> 0.63, 0.34, NA, 0.06, NA…
In the original dataset, there are 24062 observations and 106 variables. There are 41 numerical variables and 43 categorical/factor variables as seen in the table above.
In order to fit our model, we will remove columns that are redundant, of poor data quality, or that cannot be used to explore our dependent variable, price for four nights for 2 people.
#Vector of all columns we want to exclude
columns_deselect <- c(3:19, 20:22, 24:25, 27:28, 30:36, 38:39, 41:48, 58, 60, 62:63, 69:82, 84, 94:95, 96, 98, 100:105)
#Deselecting all columns we dont need
listings <- data %>%
select(-columns_deselect)In the table below, we outline our reasoning for removing some variables.
## Rows: 70
## Columns: 2
## $ `Variable Name` <chr> "scrape_id", "last_scraped", "name", "summary", "spac…
## $ Reason <chr> "The scrape ID is the same for every entry.", "Date o…
kable(deselected_variables[1:2]) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, font_size = 14, position = "center") %>%
scroll_box(width = "910px", height = "400px")| Variable Name | Reason |
|---|---|
| scrape_id | The scrape ID is the same for every entry. |
| last_scraped | Date of scraping not relevant for further data analysis. |
| name | All of the variables are different and only text data. |
| summary | Host description for their listing in text form. Would be difficult to extract information. |
| space | Host description for their listing in text form. Would be difficult to extract information. |
| description | Host description for their listing in text form. Would be difficult to extract information. |
| experiences_offered | All values are “none”. |
| neighborhood_overview | Host description for neighborhood in text form. Would be difficult to extract information. |
| notes | Additional information provided by host in text form. Would be difficult to extract information. |
| transit | Host description for transit in text form. Would be difficult to extract information. |
| access | Host description for listing access in text form. Would be difficult to extract information. |
| interaction | Host description for interaction with guests in text form. Would be difficult to extract information. |
| house_rules | Host description for house rules in text form. Would be difficult to extract information. |
| thumbnail_url | Url not relevant for further data analysis. |
| medium_url | Url not relevant for further data analysis. |
| picture_url | Url not relevant for further data analysis. |
| xl_picture_url | Url not relevant for further data analysis. |
| host_id | All are unique values, therefore is redundant since we are keeping listing ID |
| host_url | Url not relevant for further data analysis. |
| host_name | Host name is redundant with listing ID |
| host_location | Variable is inconsistently formatted and not relevant for further data analysis. |
| host_about | Host introduction in text form. Would be difficult to extract information. |
| host_response_rate | Most of the variables are N/A or 100%, therefore would not be useful in explaining our dependent variable. |
| host_acceptance_rate | The distribution is very heavily left skewed and contains many N/A values. |
| host_thumbnail_url | Url not relevant for further data analysis. |
| host_picture_url | Url not relevant for further data analysis. |
| host_neighbourhood | Only contains N/A values. |
| host_listings_count | Captured in calculated_host_listings_count, thus is redundant. |
| host_total_listings_count | Captured in calculated_host_listings_count, thus is redundant. |
| host_verifications | Type of verififcation irrelevant for analysis. |
| host_has_profile_pic | Out of 17,300 ppl, only 50 have no picutre. |
| street | Street name is too granular, we will use neighborhood_cleansed to account for location in our analysis. |
| neighbourhood | Only contains N/A values. |
| neighbourhood_group_cleansed | Only contains N/A values. |
| city | Data quality is poor, as there are typos and the majority of entries are in Cape Town. |
| state | Data quality is poor, as there are typos and different formats. The majority of entries are in the same state. |
| zipcode | Zip code is too granular, we will use neighborhood_cleansed to account for location in our analysis. |
| market | Market is almost always Cape Town. |
| smart_location | Data quality is poor, with cities spelled in different ways. |
| country_code | All listings are in South Africa with the exception of one listing in Congo. This is obviously a mistake, as other data like latitude or longitude clearly shows the listing is in Cape Town. Hence we can delete the column. |
| country | All listings are in ZA with the exception of one listing in CG. This is obviously a mistake, as other data like latitude or longitude clearly shows the listing is in Cape Town. Hence we can delete the column. |
| bed_type | The vast majority of listings (24,029) only contain one type of bed: “Real Bed” |
| square_feet | Data is incomplete. 99.8% are missing values. |
| weekly_price | Since we are only looking at rentals for <=4 days, the weekly price is irrelevant. |
| monthly_price | Since we are only looking at rentals for <= 4 days, the monthly price is irrelevant. |
| maximum_nights | We are looking at listings where minimum nights is <= 4, thus maximum nights is irrelevant. |
| minimum_minimum_nights | This is a forward-looking variable for the smallest minimum nights available looking 1 year out. This is irrelevant for our analysis. |
| maximum_minimum_nights | This is a forward-looking variable for the largest minimum nights available looking 1 year out. This is irrelevant for our analysis. |
| minimum_maximum_nights | This is a forward-looking variable for the smallest maximum nights available looking 1 year out. This is irrelevant for our analysis. |
| maximum_maximum_nights | This is a forward-looking variable for the largest maximum nights available looking 1 year out. This is irrelevant for our analysis. |
| minimum_nights_avg_ntm | This is a forward-looking variable for the average minimum nights available looking 1 year out. This is irrelevant for our analysis. |
| maximum_nights_avg_ntm | This is a forward-looking variable for the average maximum nights available looking 1 year out. This is irrelevant for our analysis. |
| calendar_updated | Irrelevant for future analysis as it only indicates when the listing calendar was last updated. |
| has_availability | Only has one unique value: TRUE |
| availability_30 | This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis. |
| availability_60 | This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis. |
| availability_90 | This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis. |
| availability_365 | This is a forward-looking variable for the availability of the listing x days into the future. This is irrelevant for our analysis. |
| calendar_last_scraped | Irrelevant for future analysis as it only indicates when the calendar was last scraped. |
| number_of_reviews_ltm | Represents the number of reviews in the last twelve months. It is highly correlated with number_of_reviews. We decided to keep number_of_reviews because that is how the user interface shows it on the Airbnb website. |
| requires_license | Contains only N/A values. |
| license | 99.9% incomplete. |
| jurisdiction_names | Contains only N/A values. |
| is_business_travel_ready | Only has one unique value: FALSE |
| require_guest_profile_picture | 99.9% are FALSE. |
| require_guest_phone_verification | Around 99% are FALSE. |
| calculated_host_listings_count | Most of the entries are 1 with some extreme outliers. |
| calculated_host_listings_count_entire_homes | Most of the entries are 1 with some extreme outliers. |
| calculated_host_listings_count_private_rooms | Most of the entries are 1 with some extreme outliers. |
| calculated_host_listings_count_shared_rooms | Most of the entries are 1 with some extreme outliers. |
In order to run models with our data, we need to modify some variables and create some new variables that we want to examine.
First, we must change the price variables (price, cleaning_fee, extra_people, and security_deposit) to numeric variables, as they are now characters.
# Changing price variables to numeric
listings <- listings %>%
mutate(price = parse_number(price),
cleaning_fee = parse_number(cleaning_fee),
extra_people = parse_number(extra_people),
security_deposit = parse_number(security_deposit))
# Check if these variables are numeric, whether we have any zero prices for `price` or there are any NAs
listings %>%
select(price, cleaning_fee, extra_people) %>%
skim()%>%
kable()%>%
kable_styling()| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | price | 0 | 1.000 | 2470.0 | 7527 | 0 | 659 | 1058 | 2003.0 | 300009 | ▇▁▁▁▁ |
| numeric | cleaning_fee | 7049 | 0.707 | 338.4 | 512 | 0 | 100 | 300 | 450.0 | 40000 | ▇▁▁▁▁ |
| numeric | extra_people | 0 | 1.000 | 83.8 | 220 | 0 | 0 | 0 | 96.5 | 4681 | ▇▁▁▁▁ |
We will also change host_response_time to a factor variable, that is ordered based on how long a host takes to respond. That way we can determine whether there’s a relationship between response time length and price.
# Changing host_response_time to a factor variable
listings <- listings%>%
mutate(host_response_time = factor(host_response_time,
order = TRUE, levels = c("within an hour", "within a few hours", "within a day",
"a few days or more")))We will also turn the listing ID’s into character strings rather than numeric values.
#Turn IDs into string
listings <- listings %>%
mutate(id = as.character(id),
host_id = as.character(id))We create a new variable called host_experience_y which tells us the number of years a host is active on Airbnb. We want to examine how long a host has been active impacts how much they might charge for a listing.
#max_date in host_since = 2020-06-17
# Creating a new variable call host_experience_y which tells us about the no of years the host is active on Airbnb
listings <- listings %>%
mutate(host_experience_y = as.numeric(ymd("2020-06-17") - host_since)/365)For security_deposit, since there are 8559 missing values, we will treat the missing values as having no security deposit. We will also convert to a dummy variable where if the listing has a security deposit the variable will be TRUE, an if not, FALSE. We do this because there is a massive range of security deposits and the distribution is very skewed. This will help us determine whether having a security deposit influences the price.
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | security_deposit | 8559 | 0.644 | 2770 | 6319 | 0 | 0 | 1478 | 3000 | 150000 | ▇▁▁▁▁ |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1478 | 3000 | 150000 | 2770 | 6319 | 15503 | 8559 |
#turning security_deposit into a dummy variable
listings <- listings %>%
mutate(security_deposit = case_when(
is.na(security_deposit) ~ 0,
TRUE ~ security_deposit
)) %>%
mutate(security_deposit=case_when(
security_deposit==0~FALSE,
TRUE~TRUE
))Next, we look at cancellation_policy. It looks like there are several types of cancellation policies that only have one or 2 listings with that category based on the count below. Additionally, to simplify, we can group all of the policies into three main categories: flexible, moderate, and strict.
| cancellation_policy | n |
|---|---|
| flexible | 7822 |
| luxury_moderate | 59 |
| luxury_super_strict_125 | 1 |
| luxury_super_strict_95 | 9 |
| moderate | 5620 |
| strict | 2 |
| strict_14_with_grace_period | 10101 |
| super_strict_30 | 397 |
| super_strict_60 | 51 |
#We first convert cancellation policy into a 3-category variable where cancellation policy can be flexible, moderate, and strict.
listings<- listings %>%
mutate(
cancellation_policy = case_when (cancellation_policy %in% c("flexible") ~"flexible", cancellation_policy %in% c("luxury_moderate", "moderate") ~"moderate",
cancellation_policy %in%c("luxury_super_strict_125","luxury_super_strict_95",
"strict","strict_14_with_grace_period","super_strict_30",
"super_strict_60")~"strict"),
)We want to examine whether having air conditioning affects the listing price, as we noticed that only around 30% of properties have air conditioning installed. We introduce a new variable that evaluates to TRUE if string ‘Air conditioning’ is included in amenities, and FALSE otherwise.
#creating the A/C variable
listings <- listings %>%
mutate(ac=str_detect(amenities,'Air conditioning'))
proportions<-listings %>%
group_by(ac) %>%
summarise(count=n()) %>%
mutate(proportion=count/sum(count))
proportions%>%
kable()%>%
kable_styling()| ac | count | proportion |
|---|---|---|
| FALSE | 17127 | 0.712 |
| TRUE | 6935 | 0.288 |
It seems that having Air conditioning in Cape Town is some sort of a luxury. Furthermore, according to website Statista, only 6% of households in South Africa have air conditioning (Global AC penetration rate by country 2016 | Statista, 2020).
We will also convert the amenities variable to be a count of the number of amenities in a certain property, rather than comma separated values.
Additionally, we simplify the neighbourhood_cleansed by condensing them into the new variable neighborhood_simplified. Cape Town is separated into over 100 Wards, which would be difficult to include in our model. Through our research, we found that Cape Town is also segmented into 24 sub-councils. We assigned each ward to one of the 24 sub-councils following official data from capetown.gov.za. Then, based on an interview with a Cape Town local, we assigned each sub-council a grade corresponding to cost of living and quality of life, with the levels: best, better, good, and standard.
# Cape Town is made up of 100+ diffeerent Wards.
# The column neighborhood_cleansed contains information on which Ward the listing is located in
# We first assign each Ward to 1 of the 24 sub-councils that make up Cape Town, following official data from the Cape Town website
# (http://resource.capetown.gov.za/documentcentre/Documents/Maps%20and%20statistics/Population_and_Households_by_Ward_2001_and_2011.pdf)
#Vector for each sub district
SC01 <-c("Ward 23", "Ward 29", "Ward 32", "Ward 104")
SC02 <-c("Ward 6", "Ward 7", "Ward 8", "Ward 101", "Ward 102", "Ward 111")
SC03 <-c("Ward 1", "Ward 4", "Ward 5", "Ward 70", "Ward 107", "Ward 113")
SC04 <-c("Ward 25", "Ward 26", "Ward 27", "Ward 28", "Ward 30")
SC05 <-c("Ward 13", "Ward 20", "Ward 24", "Ward 31", "Ward 50", "Ward 106")
SC06 <-c("Ward 2", "Ward 3", "Ward 9", "Ward 10", "Ward 12", "Ward 22")
SC07 <-c("Ward 21", "Ward 103", "Ward 105", "Ward 112")
SC08 <-c("Ward 83", "Ward 85", "Ward 86", "Ward 100")
SC09 <-c("Ward 18", "Ward 87", "Ward 89", "Ward 90", "Ward 91", "Ward 116")
SC10 <-c("Ward 92", "Ward 93", "Ward 94", "Ward 97", "Ward 98", "Ward 99")
SC11 <-c("Ward 40", "Ward 44", "Ward 46", "Ward 47")
SC12 <-c("Ward 78", "Ward 79", "Ward 81", "Ward 82")
SC13 <-c("Ward 34", "Ward 35", "Ward 36", "Ward 80", "Ward 88")
SC14 <-c("Ward 37", "Ward 38", "Ward 39", "Ward 41", "Ward 42", "Ward 45")
SC15 <-c("Ward 51", "Ward 52", "Ward 53", "Ward 55", "Ward 56")
SC16 <-c("Ward 54", "Ward 57", "Ward 74", "Ward 77", "Ward 115")
SC17 <-c("Ward 48", "Ward 49", "Ward 60")
SC18 <-c("Ward 63", "Ward 65", "Ward 66", "Ward 67", "Ward 68", "Ward 110")
SC19 <-c("Ward 61", "Ward 64", "Ward 69")
SC20 <-c("Ward 58", "Ward 59", "Ward 62", "Ward 71", "Ward 72", "Ward 73")
SC21 <-c("Ward 11", "Ward 19", "Ward 108")
SC22 <-c("Ward 14", "Ward 16", "Ward 17", "Ward 114")
SC23 <-c("Ward 33", "Ward 43", "Ward 75", "Ward 76")
SC24 <-c("Ward 15", "Ward 84", "Ward 95", "Ward 96", "Ward 109")# Based on an interview with a local, we then assigned the 24 sub-council a grade corresponding to quality of life
# vector for living quality
Best <-c(SC16, SC19, SC20)
Better <-c(SC01, SC08, SC15, SC17)
Good <-c(SC02, SC03, SC04, SC05, SC06, SC07, SC08, SC15, SC21, SC22, SC23, SC24)
Standard <-c(SC09, SC10, SC11, SC12, SC13, SC14, SC17, SC18)# Next, we create two new variables: "sub_council", containing the sub council for each listing and "neighborhood_simplified", classing the location of the listing to either standard, good, better or best.
#assigning sub council
listings <- listings %>%
mutate(sub_council = case_when(neighbourhood_cleansed %in% SC01 ~ "SC01",
neighbourhood_cleansed %in% SC02 ~ "SC02",
neighbourhood_cleansed %in% SC03 ~ "SC03",
neighbourhood_cleansed %in% SC04 ~ "SC04",
neighbourhood_cleansed %in% SC05 ~ "SC05",
neighbourhood_cleansed %in% SC06 ~ "SC06",
neighbourhood_cleansed %in% SC07 ~ "SC07",
neighbourhood_cleansed %in% SC08 ~ "SC08",
neighbourhood_cleansed %in% SC09 ~ "SC09",
neighbourhood_cleansed %in% SC10 ~ "SC10",
neighbourhood_cleansed %in% SC11 ~ "SC11",
neighbourhood_cleansed %in% SC12 ~ "SC12",
neighbourhood_cleansed %in% SC13 ~ "SC13",
neighbourhood_cleansed %in% SC14 ~ "SC14",
neighbourhood_cleansed %in% SC15 ~ "SC15",
neighbourhood_cleansed %in% SC16 ~ "SC16",
neighbourhood_cleansed %in% SC17 ~ "SC17",
neighbourhood_cleansed %in% SC18 ~ "SC18",
neighbourhood_cleansed %in% SC19 ~ "SC19",
neighbourhood_cleansed %in% SC20 ~ "SC20",
neighbourhood_cleansed %in% SC21 ~ "SC21",
neighbourhood_cleansed %in% SC22 ~ "SC22",
neighbourhood_cleansed %in% SC23 ~ "SC23",
neighbourhood_cleansed %in% SC24 ~ "SC24"
))# Assigning neighbourhood simplified
listings <- listings %>%
mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% Best ~ "best",
neighbourhood_cleansed %in% Better ~ "better",
neighbourhood_cleansed %in% Good ~ "good",
neighbourhood_cleansed %in% Standard ~ "standard"))Now we will take a look at the minimum_nights variable and filter only to include listings where the minimum number of nights is less than or equal to 4.
#To check the 5 most common values for the variable `minimum_nights`
listings%>%
count(minimum_nights)%>%
arrange(desc(n))%>%
kable()%>%
kable_styling()| minimum_nights | n |
|---|---|
| 2 | 7436 |
| 1 | 6894 |
| 3 | 3527 |
| 7 | 1614 |
| 5 | 1489 |
| 4 | 1098 |
| 10 | 508 |
| 14 | 337 |
| 20 | 252 |
| 6 | 244 |
| 30 | 182 |
| 8 | 62 |
| 90 | 59 |
| 28 | 50 |
| 12 | 47 |
| 15 | 36 |
| 21 | 35 |
| 60 | 33 |
| 180 | 27 |
| 9 | 17 |
| 13 | 12 |
| 31 | 11 |
| 25 | 9 |
| 100 | 8 |
| 120 | 7 |
| 29 | 6 |
| 365 | 6 |
| 11 | 5 |
| 18 | 3 |
| 27 | 3 |
| 45 | 3 |
| 93 | 3 |
| 150 | 3 |
| 360 | 3 |
| 16 | 2 |
| 19 | 2 |
| 22 | 2 |
| 24 | 2 |
| 32 | 2 |
| 95 | 2 |
| 168 | 2 |
| 183 | 2 |
| 185 | 2 |
| 210 | 2 |
| 26 | 1 |
| 33 | 1 |
| 40 | 1 |
| 42 | 1 |
| 89 | 1 |
| 91 | 1 |
| 94 | 1 |
| 124 | 1 |
| 182 | 1 |
| 186 | 1 |
| 240 | 1 |
| 300 | 1 |
| 1125 | 1 |
The top 5 most common values are for minimum nights are: 2, 1, 3, 7 and 5. The top values seem to be a week or less, which means most listings are probably used for tourism. However, some of the less common values also offer minimum stays of 30 or 365 days which demonstrates that some listings can be monthly or yearly rentals.
Finally we will inspect the accommodates variable. We want to filter out the 737 listings that can only accommodate one person, since we are interested in places where two people could rent.
#To check the 5 most common values for the variable accommodates
listings %>%
count(accommodates) %>%
arrange(desc(accommodates))%>%
kable()%>%
kable_styling()| accommodates | n |
|---|---|
| 40 | 1 |
| 20 | 1 |
| 18 | 1 |
| 16 | 81 |
| 15 | 5 |
| 14 | 50 |
| 13 | 13 |
| 12 | 130 |
| 11 | 13 |
| 10 | 294 |
| 9 | 61 |
| 8 | 712 |
| 7 | 214 |
| 6 | 1785 |
| 5 | 617 |
| 4 | 4466 |
| 3 | 1069 |
| 2 | 8705 |
| 1 | 737 |
Some of our variables have missing values. Cleaning fee, for example, which we will use to calculate price for four nights, has 7049 missing values, as seen below. This is most likely because these listings do not charge a cleaning fee, so we will treat N/A values as 0.
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | cleaning_fee | 4894 | 0.731 | 303 | 368 | 0 | 80 | 250 | 450 | 17347 | ▇▁▁▁▁ |
Now, we will look at some of the variables that we want to examine when we eventually build our model. We want to look at how the variables are distributed, and whether there is any correlation between explanatory variables.
First, we will examine the host_experience_y variable that we created, which demonstrates how long a host has been listed on Airbnb.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.53 | 4.16 | 5.34 | 11.7 | 4.04 | 2.06 | 16562 | 2 |
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | data | 2 | 1 | 4.04 | 2.06 | 0 | 2.53 | 4.16 | 5.34 | 11.7 | ▅▇▆▂▁ |
ggplot(listings, aes(x=host_experience_y)) +
geom_density(fill = "lavender") +
labs(x = "Host Experience", y="Density", title= "Host Experience Density Plot") +
theme_minimal() Based on the density plot, it looks like host_experience_y is a multi-modal distribution which is slightly right skewed. The mean is 4.13 years and the median is 4.27.
Next, we will look at host_is_superhost. We might want to use this in our model to examine whether superhosts tend to charge a premium.
| skim_type | skim_variable | n_missing | complete_rate | logical.mean | logical.count |
|---|---|---|---|---|---|
| logical | data | 2 | 1 | 0.25 | FAL: 12417, TRU: 4145 |
| host_is_superhost | n |
|---|---|
| FALSE | 12417 |
| TRUE | 4145 |
| NA | 2 |
ggplot(listings, aes(x = host_is_superhost)) +
geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
labs(x = "Super Host", y="Density", title= "Distribution of Superhosts") +
theme_minimal() Based on the data above, roughly 1/4 of all hosts in Cape Town are super hosts.
Now, we look at number_of_reviews, which represents the total number of reviews on a particular listing.
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | number_of_reviews | 0 | 1 | 16.2 | 31 | 0 | 0 | 4 | 17 | 425 | ▇▁▁▁▁ |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 4 | 17 | 425 | 16.2 | 31 | 16564 | 0 |
listings%>%
ggplot(aes(x=number_of_reviews))+
geom_histogram(aes(x = number_of_reviews), color = "blue", fill = "#77a5f0") +
labs(x = "Number of Reviews", y="Count", title= "Total Number of Reviews") +
ggtitle("Total Number of Reviews") +
coord_cartesian(xlim = c(0, 220)) +
theme_minimal()Based on the histogram, number of reviews is very skewed. It seems like the majority of listings do not have many reviews. The median is 3, yet the mean is 14.42, indicating that the variable is right-skewed.
Now we’ll look at the review_scores_rating variable, as we want to examine whether listings with higher reviews charge a premium.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 20 | 93 | 97 | 100 | 100 | 94.5 | 8.93 | 11959 | 4605 |
There are 7314 missing values for review_scores_rating, which most likely means that those listings did not receive a review. The median rating is 97 and the mean rating is 94.58. The standard deviation is 8.87.
Now let’s visualize the distribution:
listings%>%
ggplot(aes(x=review_scores_rating))+
geom_density(fill="lavender") +
labs(x = "Review Score Rating") +
ggtitle("Distribution of Review Score Rating.") +
theme_minimal() The distribution is left-skewed with most of the ratings in the 90’s. There is a small second peak at 80.
The following variables are reviews for individual categories. They are all discrete values from 0-10 and exhibit similar distributions. We will keep them all to explore whether some variables impact price more than others.
accuracy <- listings%>%
ggplot(aes(x=review_scores_accuracy)) +
geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
labs(x = "Review Score Accuracy") +
ggtitle("Accuracy") +
theme_minimal()cleanliness <- listings%>%
ggplot(aes(x=review_scores_cleanliness))+
geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
labs(x = "Review Score Cleanliness") +
ggtitle("Cleanliness") +
theme_minimal()communication <- listings%>%
ggplot(aes(x=review_scores_communication))+
geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
labs(x = "Review Score Communication") +
ggtitle("Communication") +
theme_minimal()check_in <- listings%>%
ggplot(aes(x=review_scores_checkin))+
geom_bar(width = 0.5, color = "blue", fill="#77a5f0") +
labs(x = "Review Score Check-in") +
ggtitle("Check-in") +
theme_minimal()location <- ggplot(listings, aes(x = review_scores_location)) +
geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
labs(x = "Review Score Location") +
ggtitle("Location") +
theme_minimal() +
NULLvalue <- ggplot(listings, aes(x = review_scores_value)) +
geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
labs(x = "Review Score Value") +
ggtitle("Value") +
theme_minimal() +
NULLlibrary(gridExtra)
grid.arrange(accuracy, cleanliness, check_in, communication, location, value, nrow = 3, ncol = 3)We also want to look at whether if a property has an instant booking option it impacts price. It looks like, based on the table below, properties are pretty evenly split between the two options.
| skim_type | skim_variable | n_missing | complete_rate | logical.mean | logical.count |
|---|---|---|---|---|---|
| logical | instant_bookable | 0 | 1 | 0.563 | TRU: 9332, FAL: 7232 |
ggplot(listings, aes(x = instant_bookable)) +
geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
labs(x = "Instant Booking") +
ggtitle("Count of listings with or without Instant Booking") +
theme_minimal() +
NULLWe also want to explore the property_type variable.
#To check the top 4 categories of property type in Cape Town
listings %>%
count(property_type) %>%
arrange(desc(n))%>%
kable()%>%
kable_styling()| property_type | n |
|---|---|
| Apartment | 7256 |
| House | 3638 |
| Guest suite | 1314 |
| Guesthouse | 911 |
| Bed and breakfast | 569 |
| Villa | 484 |
| Serviced apartment | 407 |
| Condominium | 398 |
| Cottage | 350 |
| Townhouse | 318 |
| Loft | 262 |
| Boutique hotel | 158 |
| Bungalow | 90 |
| Other | 76 |
| Hostel | 66 |
| Farm stay | 50 |
| Chalet | 44 |
| Tiny house | 31 |
| Aparthotel | 29 |
| Cabin | 24 |
| Hotel | 24 |
| Nature lodge | 14 |
| Camper/RV | 9 |
| Earth house | 9 |
| Boat | 6 |
| Resort | 5 |
| Dorm | 4 |
| Tipi | 4 |
| Hut | 2 |
| Lighthouse | 2 |
| Treehouse | 2 |
| Barn | 1 |
| Bus | 1 |
| Casa particular (Cuba) | 1 |
| Castle | 1 |
| Dome house | 1 |
| Heritage hotel (India) | 1 |
| Tent | 1 |
| Vacation home | 1 |
As we can see, the four most common listings (Apartment, House, Guest Suite, and Villa) make up ~80% of all listings. Thus, to simplify, we will create a new variable with five categories: apartment, house, guest suite, villa, and other.
#To simplify the `property_type` variable into 5 categories, Mutate others to gather the other property type with small values
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","House","Guest suite","Villa","Guest House","Bed and breakfast") ~ property_type,
TRUE ~ "Other"
))
listings %>%
ggplot(aes(x=prop_type_simplified)) +
labs(x = "Property Type") +
ggtitle("Simplified Property Type") +
geom_bar(width = 0.5, color = "blue", fill = "#6dccc1") +
theme_minimal()Finally, we examine the distribution of listings based on quality of life in the area that they are located.
#count by neighborhood
listings %>%
count(neighbourhood_simplified) %>%
arrange(desc(n))%>%
kable()%>%
kable_styling()| neighbourhood_simplified | n |
|---|---|
| best | 10989 |
| good | 3000 |
| better | 2402 |
| standard | 173 |
ggplot(listings, aes(x = neighbourhood_simplified)) +
geom_bar(width = 0.5, color = "blue", fill = "#77a5f0") +
labs(x = "Neighborhood", y="Count", title="Distribution of listings based on cost of living") +
theme_minimal()The majority of the Airbnb listings seem to be located in Wards that are classified as “best” in regards to location. These are likely to be popular destinations for tourists.
Now we will take a look at some of the relationships between the explanatory variables that we might want to include in our model, to see if anything stands out.
# To see the correlations between variables
listings%>%
select(bedrooms, bathrooms, accommodates, review_scores_rating, host_experience_y, price_4_nights) %>%
ggpairs(alpha=0.4) +
theme_minimal() Based on the correlation matrix above, it looks like the variables bathroom,bedrooms, and accommodates are highly correlated. We most likely will not want to include all of them in our model.
Now, we will look at the relationship between property type and price.
listings %>%
ggplot(aes(x=prop_type_simplified, y= price_4_nights)) +
geom_boxplot(aes(fill=prop_type_simplified), show.legend=FALSE)+
theme_minimal()+
labs(x="Property Type", title="Property Type and Price for Four Nights")+
scale_y_continuous(name="Price for 4 nights", limits = c(0,60000))Based on the plot, it seems like Villa and House might be indicators of a higher price. We will test them when we build our model.
We will also examine neighborhood in the plot below.
listings %>%
ggplot(aes(x=neighbourhood_simplified, y= price_4_nights)) +
geom_boxplot(aes(fill=neighbourhood_simplified), show.legend=FALSE)+
theme_minimal()+
labs(x="Neighbourood", title="Neighbourhood and Price for Four Nights")+
scale_y_continuous(name="Price for 4 nights", limits = c(0,60000)) As expected, it seems like listings in the “best” neighbourhoods are most expensive, while those in “standard” are cheapest. We will test whether this effect is significant when we build our model.
We will also look at host_is_superhost.
listings %>%
ggplot(aes(x=host_is_superhost, y= price_4_nights)) +
geom_boxplot(aes(fill=host_is_superhost, show.legend=FALSE))+
theme_minimal()+
labs(x="Superhost", title="Superhost and Price for Four Nights")+
scale_y_continuous(name="Price for 4 nights", limits = c(0,60000)) There does not seem to be a large distinction between whether a host is a superhost and the price of the listing, but we will confirm when we test our model.
We will also take a look at cancellation_policy.
listings %>%
ggplot(aes(x=cancellation_policy, y= price_4_nights)) +
geom_boxplot(aes(fill=cancellation_policy, show.legend=FALSE))+
theme_minimal()+
labs(x="Cancellation Policy", title="Cancellation Policy and Price for Four Nights")+
scale_y_continuous(name="Price for 4 nights", limits = c(0,60000)) It seems like listings with stricter policies are slightly more expensive, which could be because they are more desirable listings. We will test the significance in our model fitting.
Finally, we will examine the air conditioning variable that we created.
listings %>%
ggplot(aes(x=ac, y= price_4_nights)) +
geom_boxplot(aes(fill=ac, show.legend=FALSE))+
theme_minimal()+
labs(x="Air conditioning", title="Air conditioning and Price for Four Nights")+
scale_y_continuous(name="Price for 4 nights", limits = c(0,60000)) As expected, it seems like listings that have air conditioning are more expensive. We will examine the effect as we construct our model.
Visualizing the map of listings in Cape Town
#density plot price_4_nights
ggplot(data=listings, aes(x=price_4_nights)) +
geom_density(fill = "lavender") +
labs(x="log of Price for Four Nights") +
theme_minimal()#density plot log
ggplot(data=listings, aes(x=price_4_nights)) +
scale_x_log10()+geom_density(fill = "lavender") +
labs(x="log of Price for Four Nights") +
theme_minimal() Price_4_nights is very heavily positively skewed. Log (price_4_nights), on the other hand, is approximately normally distributed. Hence, log (price_4_nights) is a better candidate for further modeling.
Next, we fit a regression model with 3 explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
# ***** Fit linear regression models: First, just the mean
model0 <- lm(log (price_4_nights) ~ 1, data= listings)
model0 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 8.45 | 0.0061 | 1.39e+03 | 0 | 8.44 | 8.47 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.785 | -1.95e+04 | 3.9e+04 | 3.9e+04 | 1.02e+04 | 16563 | 16564 |
# ***** Fit linear regression models: criminals on 3 explanatory variables.
model1 <- lm(log (price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data= listings)
model1 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.824 | 0.065 | 120.148 | 0.000 | 7.696 | 7.951 |
| prop_type_simplifiedBed and breakfast | -0.107 | 0.046 | -2.319 | 0.020 | -0.198 | -0.017 |
| prop_type_simplifiedGuest suite | -0.422 | 0.022 | -18.935 | 0.000 | -0.465 | -0.378 |
| prop_type_simplifiedHouse | 0.130 | 0.016 | 8.051 | 0.000 | 0.098 | 0.161 |
| prop_type_simplifiedOther | 0.000 | 0.017 | -0.028 | 0.978 | -0.033 | 0.033 |
| prop_type_simplifiedVilla | 1.233 | 0.038 | 32.763 | 0.000 | 1.160 | 1.307 |
| number_of_reviews | -0.001 | 0.000 | -6.488 | 0.000 | -0.002 | -0.001 |
| review_scores_rating | 0.006 | 0.001 | 8.972 | 0.000 | 0.005 | 0.008 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.131 | 0.13 | 0.667 | 257 | 0 | 7 | -12118 | 24255 | 24321 | 5314 | 11951 | 11959 |
Below, we interpret the coefficients of our model: review_scores_rating For every 1 unit increase in review_scores_rating, the price_4_nights increases by 0.62% . As (exp(0.006188)-1)100 = 0.62% prop_type_simplified Price for 4 nights at a Bed and Breakfast is on average 10.2% lower than the price for an apartment. (Equivalent to (exp(-0.107467) - 1)100) = (-10.2%) Price for 4 nights at a Guest Suite is on average ** 34.4% lower than the price for an apartment. (Equivalent to (exp(-0.421721) - 1)100) = -34.4%) Price for 4 nights at a House is on average 13.8% higher than the price for an apartment. (Equivalent to (exp(0.129616) - 1)100) = 13.8%) Price for 4 nights at a Villa is on average 243.3% higher than the price for an apartment. (Equivalent to (exp(1.233319) - 1)*100)= 243.3%)
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model.
# ***** Fit linear regression models: criminals on 4 explanatory variables.
model2 <- lm(log (price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data= listings)
model2 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 8.059 | 0.060 | 134.115 | 0.000 | 7.941 | 8.177 |
| prop_type_simplifiedBed and breakfast | 0.315 | 0.047 | 6.754 | 0.000 | 0.223 | 0.406 |
| prop_type_simplifiedGuest suite | -0.287 | 0.021 | -13.894 | 0.000 | -0.328 | -0.247 |
| prop_type_simplifiedHouse | 0.369 | 0.016 | 23.534 | 0.000 | 0.338 | 0.399 |
| prop_type_simplifiedOther | 0.094 | 0.016 | 5.918 | 0.000 | 0.063 | 0.125 |
| prop_type_simplifiedVilla | 1.334 | 0.035 | 38.501 | 0.000 | 1.266 | 1.402 |
| number_of_reviews | -0.002 | 0.000 | -9.458 | 0.000 | -0.002 | -0.001 |
| review_scores_rating | 0.004 | 0.001 | 6.790 | 0.000 | 0.003 | 0.006 |
| room_typeHotel room | 0.008 | 0.050 | 0.159 | 0.873 | -0.091 | 0.107 |
| room_typePrivate room | -0.712 | 0.016 | -45.918 | 0.000 | -0.742 | -0.682 |
| room_typeShared room | -1.046 | 0.109 | -9.613 | 0.000 | -1.259 | -0.832 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.267 | 0.266 | 0.613 | 434 | 0 | 10 | -11103 | 22231 | 22320 | 4484 | 11948 | 11959 |
All else equal, room_typePrivate room and room_typeShared room are significant predictors of the cost for 4 night as it has T-statistics >2 and p-vale <0.05.
room_typeHotel room does not seem to be significantly different from the baseline room_type variable (“Entire home/apt”) as its |T statistic| < 2.
Now we will explore adding further variables.
We start our own exploration now. We first wonder if we can improve model 2.
model3 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type, data= listings)
model3 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 8.077 | 0.060 | 133.989 | 0.000 | 7.959 | 8.195 |
| prop_type_simplifiedBed and breakfast | 0.329 | 0.047 | 7.034 | 0.000 | 0.237 | 0.420 |
| prop_type_simplifiedGuest suite | -0.295 | 0.021 | -14.221 | 0.000 | -0.335 | -0.254 |
| prop_type_simplifiedHouse | 0.379 | 0.016 | 24.183 | 0.000 | 0.348 | 0.410 |
| prop_type_simplifiedOther | 0.092 | 0.016 | 5.795 | 0.000 | 0.061 | 0.123 |
| prop_type_simplifiedVilla | 1.347 | 0.035 | 38.767 | 0.000 | 1.279 | 1.416 |
| review_scores_rating | 0.004 | 0.001 | 5.855 | 0.000 | 0.002 | 0.005 |
| room_typeHotel room | 0.025 | 0.051 | 0.504 | 0.614 | -0.074 | 0.125 |
| room_typePrivate room | -0.704 | 0.016 | -45.310 | 0.000 | -0.735 | -0.674 |
| room_typeShared room | -1.023 | 0.109 | -9.375 | 0.000 | -1.237 | -0.809 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.261 | 0.26 | 0.615 | 469 | 0 | 9 | -11148 | 22318 | 22399 | 4518 | 11949 | 11959 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.077277 0.060283 133.99 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.328617 0.046721 7.03 2.1e-12 ***
## prop_type_simplifiedGuest suite -0.294813 0.020731 -14.22 < 2e-16 ***
## prop_type_simplifiedHouse 0.379143 0.015678 24.18 < 2e-16 ***
## prop_type_simplifiedOther 0.092156 0.015904 5.79 7.0e-09 ***
## prop_type_simplifiedVilla 1.347416 0.034756 38.77 < 2e-16 ***
## review_scores_rating 0.003716 0.000635 5.86 4.9e-09 ***
## room_typeHotel room 0.025498 0.050574 0.50 0.61
## room_typePrivate room -0.704154 0.015541 -45.31 < 2e-16 ***
## room_typeShared room -1.023137 0.109138 -9.37 < 2e-16 ***
##
## Residual standard error: 0.615 on 11949 degrees of freedom
## (4605 observations deleted due to missingness)
## Multiple R-squared: 0.261, Adjusted R-squared: 0.26
## F-statistic: 469 on 9 and 11949 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.37 5 1.03
## review_scores_rating 1.02 1 1.01
## room_type 1.36 3 1.05
After taking number_of_reviews away, we find that there is almost no change in adjusted R-squared. Also, there is no change in collinearity and significance after eliminating number_of_reviews. So, we can get rid of number_of_reviews in our model.
Now, let’s take a look at the number of bathrooms, bedrooms, beds and the number of people they can accommodates. We want to know if they are significant in predicting price of 4 nights in Cape Town.
model4 <- lm(log (price_4_nights) ~ prop_type_simplified +
review_scores_rating + room_type + bathrooms + bedrooms + beds +
accommodates, data= listings)
model4 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.443 | 0.053 | 140.283 | 0.000 | 7.339 | 7.547 |
| prop_type_simplifiedBed and breakfast | 0.082 | 0.040 | 2.030 | 0.042 | 0.003 | 0.161 |
| prop_type_simplifiedGuest suite | -0.236 | 0.018 | -13.291 | 0.000 | -0.271 | -0.201 |
| prop_type_simplifiedHouse | 0.016 | 0.015 | 1.088 | 0.277 | -0.013 | 0.044 |
| prop_type_simplifiedOther | 0.007 | 0.014 | 0.484 | 0.629 | -0.020 | 0.033 |
| prop_type_simplifiedVilla | 0.450 | 0.033 | 13.790 | 0.000 | 0.386 | 0.514 |
| review_scores_rating | 0.004 | 0.001 | 7.334 | 0.000 | 0.003 | 0.005 |
| room_typeHotel room | 0.143 | 0.043 | 3.318 | 0.001 | 0.059 | 0.228 |
| room_typePrivate room | -0.384 | 0.014 | -26.812 | 0.000 | -0.412 | -0.356 |
| room_typeShared room | -0.764 | 0.095 | -8.086 | 0.000 | -0.950 | -0.579 |
| bathrooms | 0.126 | 0.008 | 15.452 | 0.000 | 0.110 | 0.142 |
| bedrooms | 0.068 | 0.009 | 7.243 | 0.000 | 0.050 | 0.086 |
| beds | -0.068 | 0.005 | -14.340 | 0.000 | -0.078 | -0.059 |
| accommodates | 0.146 | 0.005 | 31.265 | 0.000 | 0.137 | 0.155 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.462 | 0.462 | 0.525 | 787 | 0 | 13 | -9213 | 18456 | 18567 | 3275 | 11903 | 11917 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.442641 0.053054 140.28 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.081656 0.040226 2.03 0.04239 *
## prop_type_simplifiedGuest suite -0.235741 0.017737 -13.29 < 2e-16 ***
## prop_type_simplifiedHouse 0.015862 0.014580 1.09 0.27667
## prop_type_simplifiedOther 0.006603 0.013649 0.48 0.62852
## prop_type_simplifiedVilla 0.449898 0.032626 13.79 < 2e-16 ***
## review_scores_rating 0.004017 0.000548 7.33 2.4e-13 ***
## room_typeHotel room 0.143330 0.043196 3.32 0.00091 ***
## room_typePrivate room -0.383781 0.014314 -26.81 < 2e-16 ***
## room_typeShared room -0.764245 0.094510 -8.09 6.7e-16 ***
## bathrooms 0.126130 0.008163 15.45 < 2e-16 ***
## bedrooms 0.067896 0.009374 7.24 4.6e-13 ***
## beds -0.068429 0.004772 -14.34 < 2e-16 ***
## accommodates 0.145776 0.004663 31.27 < 2e-16 ***
##
## Residual standard error: 0.525 on 11903 degrees of freedom
## (4647 observations deleted due to missingness)
## Multiple R-squared: 0.462, Adjusted R-squared: 0.462
## F-statistic: 787 on 13 and 11903 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.93 5 1.07
## review_scores_rating 1.02 1 1.01
## room_type 1.62 3 1.08
## bathrooms 3.20 1 1.79
## bedrooms 5.43 1 2.33
## beds 2.97 1 1.72
## accommodates 3.87 1 1.97
Bathrooms, bedrooms, beds and accommodates are all significant predictors of the cost for 4 night as they all have T-statistics >2 and p-vale <0.05. That is to say, the more bathrooms/bedrooms/beds it has or the more people it can accommodates, the more expensive it is.
However, as we mentioned above, the four variables mentioned above are highly correlated with correlation of around 0.7. Considering about collinearity, we can only keep one of those variables. Noticing that accommodates has the biggest T-statistics, we decide to keep it.
Since as stated earlier only 6% of the households in South Africa have air conditioning, we examine how significant it is in predicting the price (Statista, 2020).
model5 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac, data= listings)
model5 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.445 | 0.052 | 143.643 | 0.000 | 7.344 | 7.547 |
| prop_type_simplifiedBed and breakfast | 0.131 | 0.040 | 3.318 | 0.001 | 0.054 | 0.209 |
| prop_type_simplifiedGuest suite | -0.210 | 0.018 | -11.965 | 0.000 | -0.244 | -0.176 |
| prop_type_simplifiedHouse | 0.079 | 0.014 | 5.499 | 0.000 | 0.051 | 0.107 |
| prop_type_simplifiedOther | 0.005 | 0.013 | 0.379 | 0.705 | -0.021 | 0.032 |
| prop_type_simplifiedVilla | 0.546 | 0.032 | 17.173 | 0.000 | 0.484 | 0.609 |
| review_scores_rating | 0.004 | 0.001 | 6.915 | 0.000 | 0.003 | 0.005 |
| room_typeHotel room | 0.090 | 0.043 | 2.101 | 0.036 | 0.006 | 0.174 |
| room_typePrivate room | -0.362 | 0.014 | -25.605 | 0.000 | -0.389 | -0.334 |
| room_typeShared room | -1.073 | 0.092 | -11.651 | 0.000 | -1.253 | -0.892 |
| accommodates | 0.167 | 0.003 | 59.584 | 0.000 | 0.161 | 0.172 |
| acTRUE | 0.326 | 0.011 | 30.214 | 0.000 | 0.305 | 0.347 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.474 | 0.474 | 0.519 | 979 | 0 | 11 | -9115 | 18256 | 18352 | 3215 | 11947 | 11959 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.445308 0.051832 143.64 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.131212 0.039540 3.32 0.00091 ***
## prop_type_simplifiedGuest suite -0.210012 0.017552 -11.97 < 2e-16 ***
## prop_type_simplifiedHouse 0.078885 0.014347 5.50 3.9e-08 ***
## prop_type_simplifiedOther 0.005106 0.013484 0.38 0.70496
## prop_type_simplifiedVilla 0.546296 0.031812 17.17 < 2e-16 ***
## review_scores_rating 0.003707 0.000536 6.91 4.9e-12 ***
## room_typeHotel room 0.089831 0.042765 2.10 0.03570 *
## room_typePrivate room -0.361531 0.014120 -25.60 < 2e-16 ***
## room_typeShared room -1.072942 0.092089 -11.65 < 2e-16 ***
## accommodates 0.166640 0.002797 59.58 < 2e-16 ***
## acTRUE 0.325647 0.010778 30.21 < 2e-16 ***
##
## Residual standard error: 0.519 on 11947 degrees of freedom
## (4605 observations deleted due to missingness)
## Multiple R-squared: 0.474, Adjusted R-squared: 0.474
## F-statistic: 979 on 11 and 11947 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.85 5 1.06
## review_scores_rating 1.02 1 1.01
## room_type 1.58 3 1.08
## accommodates 1.43 1 1.19
## ac 1.04 1 1.02
As we supposed, A/C is a significant predictor of the cost for 4 night as it has T-statistics >2 and p-vale <0.05! As we predict, if there is air conditioning in the property, it will be more expensive to rent.
Also, all variables’ VIFs are smaller than 5, indicating that there is no collinearity. So we can accept A/C in our model.
After discussing about the property itself, we also would like to see if the host is good. We think that superhosts would charge us more since they believe their properties are popular in the market.
model6 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac + host_is_superhost, data= listings)
model6 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.418 | 0.052 | 141.574 | 0.000 | 7.315 | 7.521 |
| prop_type_simplifiedBed and breakfast | 0.130 | 0.040 | 3.285 | 0.001 | 0.052 | 0.207 |
| prop_type_simplifiedGuest suite | -0.204 | 0.018 | -11.570 | 0.000 | -0.239 | -0.169 |
| prop_type_simplifiedHouse | 0.079 | 0.014 | 5.486 | 0.000 | 0.051 | 0.107 |
| prop_type_simplifiedOther | 0.007 | 0.013 | 0.498 | 0.619 | -0.020 | 0.033 |
| prop_type_simplifiedVilla | 0.551 | 0.032 | 17.302 | 0.000 | 0.488 | 0.613 |
| review_scores_rating | 0.004 | 0.001 | 7.514 | 0.000 | 0.003 | 0.005 |
| room_typeHotel room | 0.088 | 0.043 | 2.055 | 0.040 | 0.004 | 0.172 |
| room_typePrivate room | -0.365 | 0.014 | -25.790 | 0.000 | -0.392 | -0.337 |
| room_typeShared room | -1.076 | 0.092 | -11.689 | 0.000 | -1.256 | -0.896 |
| accommodates | 0.166 | 0.003 | 59.377 | 0.000 | 0.161 | 0.172 |
| acTRUE | 0.328 | 0.011 | 30.405 | 0.000 | 0.307 | 0.350 |
| host_is_superhostTRUE | -0.037 | 0.011 | -3.477 | 0.001 | -0.057 | -0.016 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.475 | 0.474 | 0.519 | 899 | 0 | 12 | -9109 | 18246 | 18349 | 3212 | 11946 | 11959 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.418068 0.052397 141.57 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.129826 0.039524 3.28 0.00102 **
## prop_type_simplifiedGuest suite -0.203970 0.017629 -11.57 < 2e-16 ***
## prop_type_simplifiedHouse 0.078666 0.014340 5.49 4.2e-08 ***
## prop_type_simplifiedOther 0.006716 0.013486 0.50 0.61850
## prop_type_simplifiedVilla 0.550570 0.031821 17.30 < 2e-16 ***
## review_scores_rating 0.004128 0.000549 7.51 6.2e-14 ***
## room_typeHotel room 0.087865 0.042749 2.06 0.03986 *
## room_typePrivate room -0.364763 0.014144 -25.79 < 2e-16 ***
## room_typeShared room -1.075955 0.092051 -11.69 < 2e-16 ***
## accommodates 0.166175 0.002799 59.38 < 2e-16 ***
## acTRUE 0.328491 0.010804 30.40 < 2e-16 ***
## host_is_superhostTRUE -0.036640 0.010538 -3.48 0.00051 ***
##
## Residual standard error: 0.519 on 11946 degrees of freedom
## (4605 observations deleted due to missingness)
## Multiple R-squared: 0.475, Adjusted R-squared: 0.474
## F-statistic: 899 on 12 and 11946 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.87 5 1.06
## review_scores_rating 1.07 1 1.03
## room_type 1.59 3 1.08
## accommodates 1.43 1 1.20
## ac 1.05 1 1.02
## host_is_superhost 1.09 1 1.04
Unfortunately, we are wrong. Superhosts don’t charge us more! We can see that is_super_host is NOT a significant predictor of the cost for 4 night as it has T-statistics <2 and p-vale >0.05.
We also wonder would strict cancellation policy be an indicator of the price. Intuitively, the higher the price, the more the host would lose if guests canceled. Is that true for property in Cape Town?
model7 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac + cancellation_policy, data= listings)
model7 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.383 | 0.052 | 143.03 | 0.000 | 7.282 | 7.485 |
| prop_type_simplifiedBed and breakfast | 0.128 | 0.039 | 3.27 | 0.001 | 0.052 | 0.205 |
| prop_type_simplifiedGuest suite | -0.198 | 0.017 | -11.38 | 0.000 | -0.232 | -0.164 |
| prop_type_simplifiedHouse | 0.079 | 0.014 | 5.58 | 0.000 | 0.051 | 0.107 |
| prop_type_simplifiedOther | 0.003 | 0.013 | 0.25 | 0.803 | -0.023 | 0.030 |
| prop_type_simplifiedVilla | 0.533 | 0.032 | 16.92 | 0.000 | 0.471 | 0.595 |
| review_scores_rating | 0.004 | 0.001 | 6.83 | 0.000 | 0.003 | 0.005 |
| room_typeHotel room | 0.087 | 0.042 | 2.06 | 0.040 | 0.004 | 0.170 |
| room_typePrivate room | -0.350 | 0.014 | -24.91 | 0.000 | -0.377 | -0.322 |
| room_typeShared room | -1.038 | 0.091 | -11.38 | 0.000 | -1.217 | -0.859 |
| accommodates | 0.163 | 0.003 | 58.74 | 0.000 | 0.158 | 0.169 |
| acTRUE | 0.320 | 0.011 | 29.95 | 0.000 | 0.299 | 0.341 |
| cancellation_policymoderate | 0.033 | 0.012 | 2.63 | 0.008 | 0.008 | 0.057 |
| cancellation_policystrict | 0.166 | 0.012 | 14.28 | 0.000 | 0.143 | 0.189 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.485 | 0.484 | 0.514 | 864 | 0 | 13 | -8993 | 18015 | 18126 | 3150 | 11945 | 11959 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.383479 0.051621 143.03 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.128266 0.039163 3.28 0.0011 **
## prop_type_simplifiedGuest suite -0.197971 0.017392 -11.38 < 2e-16 ***
## prop_type_simplifiedHouse 0.079175 0.014203 5.57 2.5e-08 ***
## prop_type_simplifiedOther 0.003336 0.013348 0.25 0.8027
## prop_type_simplifiedVilla 0.532964 0.031505 16.92 < 2e-16 ***
## review_scores_rating 0.003627 0.000531 6.83 9.1e-12 ***
## room_typeHotel room 0.087091 0.042352 2.06 0.0398 *
## room_typePrivate room -0.349656 0.014036 -24.91 < 2e-16 ***
## room_typeShared room -1.037813 0.091191 -11.38 < 2e-16 ***
## accommodates 0.163234 0.002779 58.74 < 2e-16 ***
## acTRUE 0.319853 0.010679 29.95 < 2e-16 ***
## cancellation_policymoderate 0.032820 0.012468 2.63 0.0085 **
## cancellation_policystrict 0.166290 0.011643 14.28 < 2e-16 ***
##
## Residual standard error: 0.514 on 11945 degrees of freedom
## (4605 observations deleted due to missingness)
## Multiple R-squared: 0.485, Adjusted R-squared: 0.484
## F-statistic: 864 on 13 and 11945 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.86 5 1.06
## review_scores_rating 1.02 1 1.01
## room_type 1.60 3 1.08
## accommodates 1.44 1 1.20
## ac 1.04 1 1.02
## cancellation_policy 1.04 2 1.01
We can see that cancellation_policy is a significant predictor of the cost for 4 night as it has T-statistics <2 and p-vale >0.05. It is true that host will impose strict cancellation policy if the price of his property is high.
Also, all variables’ VIFs are smaller than 5, indicating that there is no collinearity. So we can accept cancellation_policy in our model.
As mentioned previously, we have two ways to simplify neighbourhood information. Here, we compare two methods (Method 1 in model 8, Method 2 in model 9) and select the best to add to our model.
model8 <- lm(log (price_4_nights) ~ prop_type_simplified + review_scores_rating + room_type + accommodates + ac + cancellation_policy + neighbourhood_simplified, data= listings)
model8 %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.529 | 0.051 | 148.910 | 0.000 | 7.430 | 7.628 |
| prop_type_simplifiedBed and breakfast | 0.166 | 0.038 | 4.334 | 0.000 | 0.091 | 0.240 |
| prop_type_simplifiedGuest suite | -0.150 | 0.017 | -8.785 | 0.000 | -0.183 | -0.116 |
| prop_type_simplifiedHouse | 0.091 | 0.014 | 6.567 | 0.000 | 0.064 | 0.118 |
| prop_type_simplifiedOther | 0.010 | 0.013 | 0.798 | 0.425 | -0.015 | 0.036 |
| prop_type_simplifiedVilla | 0.491 | 0.031 | 15.970 | 0.000 | 0.431 | 0.551 |
| review_scores_rating | 0.003 | 0.001 | 5.427 | 0.000 | 0.002 | 0.004 |
| room_typeHotel room | 0.080 | 0.041 | 1.945 | 0.052 | -0.001 | 0.161 |
| room_typePrivate room | -0.313 | 0.014 | -22.818 | 0.000 | -0.340 | -0.286 |
| room_typeShared room | -1.084 | 0.089 | -12.214 | 0.000 | -1.258 | -0.910 |
| accommodates | 0.167 | 0.003 | 61.687 | 0.000 | 0.162 | 0.173 |
| acTRUE | 0.305 | 0.010 | 29.197 | 0.000 | 0.285 | 0.326 |
| cancellation_policymoderate | 0.012 | 0.012 | 0.994 | 0.320 | -0.012 | 0.036 |
| cancellation_policystrict | 0.137 | 0.011 | 12.057 | 0.000 | 0.115 | 0.160 |
| neighbourhood_simplifiedbetter | -0.183 | 0.014 | -13.270 | 0.000 | -0.210 | -0.156 |
| neighbourhood_simplifiedgood | -0.308 | 0.013 | -23.148 | 0.000 | -0.334 | -0.282 |
| neighbourhood_simplifiedstandard | -0.523 | 0.056 | -9.250 | 0.000 | -0.633 | -0.412 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.512 | 0.512 | 0.5 | 785 | 0 | 16 | -8661 | 17359 | 17492 | 2981 | 11942 | 11959 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.528936 0.050560 148.91 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.165522 0.038188 4.33 1.5e-05 ***
## prop_type_simplifiedGuest suite -0.149805 0.017053 -8.78 < 2e-16 ***
## prop_type_simplifiedHouse 0.091091 0.013870 6.57 5.3e-11 ***
## prop_type_simplifiedOther 0.010381 0.013009 0.80 0.425
## prop_type_simplifiedVilla 0.490780 0.030731 15.97 < 2e-16 ***
## review_scores_rating 0.002810 0.000518 5.43 5.8e-08 ***
## room_typeHotel room 0.080174 0.041220 1.95 0.052 .
## room_typePrivate room -0.313203 0.013726 -22.82 < 2e-16 ***
## room_typeShared room -1.083994 0.088748 -12.21 < 2e-16 ***
## accommodates 0.167380 0.002713 61.69 < 2e-16 ***
## acTRUE 0.305286 0.010456 29.20 < 2e-16 ***
## cancellation_policymoderate 0.012087 0.012156 0.99 0.320
## cancellation_policystrict 0.137232 0.011382 12.06 < 2e-16 ***
## neighbourhood_simplifiedbetter -0.183127 0.013800 -13.27 < 2e-16 ***
## neighbourhood_simplifiedgood -0.308200 0.013315 -23.15 < 2e-16 ***
## neighbourhood_simplifiedstandard -0.522629 0.056499 -9.25 < 2e-16 ***
##
## Residual standard error: 0.5 on 11942 degrees of freedom
## (4605 observations deleted due to missingness)
## Multiple R-squared: 0.512, Adjusted R-squared: 0.512
## F-statistic: 785 on 16 and 11942 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.91 5 1.07
## review_scores_rating 1.02 1 1.01
## room_type 1.62 3 1.08
## accommodates 1.45 1 1.20
## ac 1.06 1 1.03
## cancellation_policy 1.05 2 1.01
## neighbourhood_simplified 1.09 3 1.01
So, here is our final model. It includes the following explanatory variables: prop_type_simplified, review_scores_rating, room_type, accommodates, ac, cancellation_policy, and neighbourhood_simplified.
model_final <- lm(log (price_4_nights) ~ prop_type_simplified +
review_scores_rating + room_type + accommodates + ac +
cancellation_policy + neighbourhood_simplified, data= listings)
model_final %>%
broom::tidy(conf.int=TRUE)%>%
kable()%>%
kable_styling()| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.529 | 0.051 | 148.910 | 0.000 | 7.430 | 7.628 |
| prop_type_simplifiedBed and breakfast | 0.166 | 0.038 | 4.334 | 0.000 | 0.091 | 0.240 |
| prop_type_simplifiedGuest suite | -0.150 | 0.017 | -8.785 | 0.000 | -0.183 | -0.116 |
| prop_type_simplifiedHouse | 0.091 | 0.014 | 6.567 | 0.000 | 0.064 | 0.118 |
| prop_type_simplifiedOther | 0.010 | 0.013 | 0.798 | 0.425 | -0.015 | 0.036 |
| prop_type_simplifiedVilla | 0.491 | 0.031 | 15.970 | 0.000 | 0.431 | 0.551 |
| review_scores_rating | 0.003 | 0.001 | 5.427 | 0.000 | 0.002 | 0.004 |
| room_typeHotel room | 0.080 | 0.041 | 1.945 | 0.052 | -0.001 | 0.161 |
| room_typePrivate room | -0.313 | 0.014 | -22.818 | 0.000 | -0.340 | -0.286 |
| room_typeShared room | -1.084 | 0.089 | -12.214 | 0.000 | -1.258 | -0.910 |
| accommodates | 0.167 | 0.003 | 61.687 | 0.000 | 0.162 | 0.173 |
| acTRUE | 0.305 | 0.010 | 29.197 | 0.000 | 0.285 | 0.326 |
| cancellation_policymoderate | 0.012 | 0.012 | 0.994 | 0.320 | -0.012 | 0.036 |
| cancellation_policystrict | 0.137 | 0.011 | 12.057 | 0.000 | 0.115 | 0.160 |
| neighbourhood_simplifiedbetter | -0.183 | 0.014 | -13.270 | 0.000 | -0.210 | -0.156 |
| neighbourhood_simplifiedgood | -0.308 | 0.013 | -23.148 | 0.000 | -0.334 | -0.282 |
| neighbourhood_simplifiedstandard | -0.523 | 0.056 | -9.250 | 0.000 | -0.633 | -0.412 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.512 | 0.512 | 0.5 | 785 | 0 | 16 | -8661 | 17359 | 17492 | 2981 | 11942 | 11959 |
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.528936 0.050560 148.91 < 2e-16 ***
## prop_type_simplifiedBed and breakfast 0.165522 0.038188 4.33 1.5e-05 ***
## prop_type_simplifiedGuest suite -0.149805 0.017053 -8.78 < 2e-16 ***
## prop_type_simplifiedHouse 0.091091 0.013870 6.57 5.3e-11 ***
## prop_type_simplifiedOther 0.010381 0.013009 0.80 0.425
## prop_type_simplifiedVilla 0.490780 0.030731 15.97 < 2e-16 ***
## review_scores_rating 0.002810 0.000518 5.43 5.8e-08 ***
## room_typeHotel room 0.080174 0.041220 1.95 0.052 .
## room_typePrivate room -0.313203 0.013726 -22.82 < 2e-16 ***
## room_typeShared room -1.083994 0.088748 -12.21 < 2e-16 ***
## accommodates 0.167380 0.002713 61.69 < 2e-16 ***
## acTRUE 0.305286 0.010456 29.20 < 2e-16 ***
## cancellation_policymoderate 0.012087 0.012156 0.99 0.320
## cancellation_policystrict 0.137232 0.011382 12.06 < 2e-16 ***
## neighbourhood_simplifiedbetter -0.183127 0.013800 -13.27 < 2e-16 ***
## neighbourhood_simplifiedgood -0.308200 0.013315 -23.15 < 2e-16 ***
## neighbourhood_simplifiedstandard -0.522629 0.056499 -9.25 < 2e-16 ***
##
## Residual standard error: 0.5 on 11942 degrees of freedom
## (4605 observations deleted due to missingness)
## Multiple R-squared: 0.512, Adjusted R-squared: 0.512
## F-statistic: 785 on 16 and 11942 DF, p-value: <2e-16
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.91 5 1.07
## review_scores_rating 1.02 1 1.01
## room_type 1.62 3 1.08
## accommodates 1.45 1 1.20
## ac 1.06 1 1.03
## cancellation_policy 1.05 2 1.01
## neighbourhood_simplified 1.09 3 1.01
Below, we interpret the final model coefficients:
cancellation_policy For every 1 unit increase in review_scores_rating, there is an increase of 0.28% in price_4_nights (Equivalent to (exp(0.00281) - 1)*100)
property_type Price for 4 nights at a Bed and Breakfast is on average 18.0% higher than the price for an apartment. (Equivalent to (exp(0.16552) - 1)*100) = (18.0%) Price for 4 nights at a Guest Suite is on average 13.9% cheaper than the price for an apartment. (Equivalent to (exp(-0.14981) - 1)100) = -13.9%) Price for 4 nights at a House is on average 9.5% higher than the price for an apartment. (Equivalent to (exp(0.09109) - 1)100) = 9.5%) Price for 4 nights at a Villa is on average 63.4% higher** than the price for an apartment. (Equivalent to (exp(0.49078) - 1)*100)= 63.4%)
cancellation_policy The price of a property with a moderate cancellation policy does not differ significantly from the price for a property with a flexible cancellation policy. The price of a property with a strict cancellation policy is on average 14.7% higher than the price for a property with a flexible cancellation policy. (Equivalent to (exp(0.13723) - 1)*100)= 14.7%)
room_type The price of a Hotel room is on average 8.3% higher than the price for an Entire home or apartment. (Equivalent to (exp(0.08017) - 1)100)= 8.3%). However, the price is significant only at a 10% significance level. The price of a Private room is on average 26.8% lower than the price for an Entire home or apartment. (Equivalent to (exp(-0.31320) - 1)100)= -26.8%) The price of a Shared room is on average 66.2% lower than the price for an Entire home or apartment. (Equivalent to (exp(-1.08399) - 1)*100)= -66.2%)
accomodates For each additional person a property accommodates, the average increase in price for 4 nights is 18.2%. (Equivalent to (exp(0.16738) - 1)*100)= 18.2%)
neighbourhood_simplified The price of a property in a better neighbourhood is on average 16.7% lower than the price of a property in a best neighbourhood. (Equivalent to (exp(-0.18313) - 1)100)= -16.7%) The price of a property in a good neighbourhood is on average 26.5% lower than the price of a property in a best neighbourhood. (Equivalent to (exp(-0.30820) - 1)100)= -26.5%) The price of a property in a standard neighbourhood is on average 40.7% lower than the price of a property in a best neighbourhood. (Equivalent to (exp(-0.52263) - 1)*100)= -40.7%)
ac The price of a property with an ac is on average 35.7% higher than the price of a property without one. (Equivalent to (exp(0.30529) - 1)*100)= 35.7%)
#creating comparison table
huxreg(model1,model2, model3, model4, model5, model6, model7, model_final,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of models')| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 7.824 | 8.059 | 8.077 | 7.443 | 7.445 | 7.418 | 7.383 | 7.529 |
| (0.065) | (0.060) | (0.060) | (0.053) | (0.052) | (0.052) | (0.052) | (0.051) | |
| prop_type_simplifiedBed and breakfast | -0.107 | 0.315 | 0.329 | 0.082 | 0.131 | 0.130 | 0.128 | 0.166 |
| (0.046) | (0.047) | (0.047) | (0.040) | (0.040) | (0.040) | (0.039) | (0.038) | |
| prop_type_simplifiedGuest suite | -0.422 | -0.287 | -0.295 | -0.236 | -0.210 | -0.204 | -0.198 | -0.150 |
| (0.022) | (0.021) | (0.021) | (0.018) | (0.018) | (0.018) | (0.017) | (0.017) | |
| prop_type_simplifiedHouse | 0.130 | 0.369 | 0.379 | 0.016 | 0.079 | 0.079 | 0.079 | 0.091 |
| (0.016) | (0.016) | (0.016) | (0.015) | (0.014) | (0.014) | (0.014) | (0.014) | |
| prop_type_simplifiedOther | -0.000 | 0.094 | 0.092 | 0.007 | 0.005 | 0.007 | 0.003 | 0.010 |
| (0.017) | (0.016) | (0.016) | (0.014) | (0.013) | (0.013) | (0.013) | (0.013) | |
| prop_type_simplifiedVilla | 1.233 | 1.334 | 1.347 | 0.450 | 0.546 | 0.551 | 0.533 | 0.491 |
| (0.038) | (0.035) | (0.035) | (0.033) | (0.032) | (0.032) | (0.032) | (0.031) | |
| number_of_reviews | -0.001 | -0.002 | ||||||
| (0.000) | (0.000) | |||||||
| review_scores_rating | 0.006 | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.004 | 0.003 |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| room_typeHotel room | 0.008 | 0.025 | 0.143 | 0.090 | 0.088 | 0.087 | 0.080 | |
| (0.050) | (0.051) | (0.043) | (0.043) | (0.043) | (0.042) | (0.041) | ||
| room_typePrivate room | -0.712 | -0.704 | -0.384 | -0.362 | -0.365 | -0.350 | -0.313 | |
| (0.016) | (0.016) | (0.014) | (0.014) | (0.014) | (0.014) | (0.014) | ||
| room_typeShared room | -1.046 | -1.023 | -0.764 | -1.073 | -1.076 | -1.038 | -1.084 | |
| (0.109) | (0.109) | (0.095) | (0.092) | (0.092) | (0.091) | (0.089) | ||
| bathrooms | 0.126 | |||||||
| (0.008) | ||||||||
| bedrooms | 0.068 | |||||||
| (0.009) | ||||||||
| beds | -0.068 | |||||||
| (0.005) | ||||||||
| accommodates | 0.146 | 0.167 | 0.166 | 0.163 | 0.167 | |||
| (0.005) | (0.003) | (0.003) | (0.003) | (0.003) | ||||
| acTRUE | 0.326 | 0.328 | 0.320 | 0.305 | ||||
| (0.011) | (0.011) | (0.011) | (0.010) | |||||
| host_is_superhostTRUE | -0.037 | |||||||
| (0.011) | ||||||||
| cancellation_policymoderate | 0.033 | 0.012 | ||||||
| (0.012) | (0.012) | |||||||
| cancellation_policystrict | 0.166 | 0.137 | ||||||
| (0.012) | (0.011) | |||||||
| neighbourhood_simplifiedbetter | -0.183 | |||||||
| (0.014) | ||||||||
| neighbourhood_simplifiedgood | -0.308 | |||||||
| (0.013) | ||||||||
| neighbourhood_simplifiedstandard | -0.523 | |||||||
| (0.056) | ||||||||
| #observations | 11959 | 11959 | 11959 | 11917 | 11959 | 11959 | 11959 | 11959 |
| R squared | 0.131 | 0.267 | 0.261 | 0.462 | 0.474 | 0.475 | 0.485 | 0.512 |
| Adj. R Squared | 0.130 | 0.266 | 0.260 | 0.462 | 0.474 | 0.474 | 0.484 | 0.512 |
| Residual SE | 0.667 | 0.613 | 0.615 | 0.525 | 0.519 | 0.519 | 0.514 | 0.500 |
This table shows all of the coefficients and R squared of the models that we fit. Our final model has the highest R2 out of the models that we fit.
We can see from “Residual Vs Fitted” that residuals are random, with no pattern and around Y=0, indicating that the the regression follows the linearity assumption.
In “Normal Q-Q” plot, we notice that residuals don’t deviate from the straight line, so we can confirm that our regression follows the normality assumption.
Checking on “Scale-Location” plot, we find that residuals are stable at around 1 when fitted values are around 7 to 10. It is acceptable to say that our regression satisfies the Equal Variance assumption in general.
Lastly, we look at “Residuals Vs Leverage” plot. We can clearly see that leverage is smaller than 0.5, indicating that most data points don’t have unusual values of predictors.
Based on Airbnb data, the best predictors of the price for 2 persons for 4 nights are: property type, room type, review ratings scores, how strict the cancellation policy is, how many people the property accommodates, which neighbourhood the property is in, and if the property has air conditioning installed.
We find that the price of booking a Villa is the highest, whereas the price of booking a Guest suite is the lowest. Additionally, we find it interesting that the price of a Bed and Breakfast is on average higher than the price of an Apartment. Furthermore, it seems that the stricter the cancellation policy the higher the price. In addition, the model shows that the price varies significantly based on which neighbourhood the property is located in. Better neighbourhoods have a higher price on average, and so do the properties that accommodate more people. To conclude, since the average temperatures in Cape Town are high, having Air conditioning installed is associated with an average price increase of 30%.
We will now use our final model to predict the total cost to stay at an Airbnb for 4 nights, filtering for listings that have at least 10 reviews and an average rating of at least 90.
Preparing the testing data set:
#First we create a new data set accoring to the conditions mentioned above
listings_test_filtered <- listings %>%
filter(property_type == "Apartment",
room_type == "Private room",
number_of_reviews >= 10,
review_scores_rating >= 90
)
# A quick look at the dimensions of the data frame shows us that we are left with 128 observations that fulfill those conditions
dim(listings_test_filtered)## [1] 128 45
Predicting prices for our test listings:
#listings_test_filtered <- listings_test_filtered %>%
# add_predictions(model3, var="predict_price")
#using the predict function to get create a dataframe that contains the point estimate as well as the prediction interval
prediction_results <- predict(model_final, newdata = listings_test_filtered, interval = c("prediction"))
#as all estimates are still in log, we have to reverse that
prediction_results <-exp(prediction_results)
#we have a quick look at the results table
prediction_results%>%
kable()%>%
kable_styling()| fit | lwr | upr |
|---|---|---|
| 2519 | 946 | 6710 |
| 2514 | 944 | 6697 |
| 2521 | 946 | 6715 |
| 2889 | 1085 | 7697 |
| 2505 | 940 | 6672 |
| 2535 | 952 | 6753 |
| 2865 | 1076 | 7632 |
| 2528 | 949 | 6734 |
| 2849 | 1070 | 7589 |
| 3383 | 1270 | 9012 |
| 2512 | 943 | 6691 |
| 2889 | 1085 | 7697 |
| 2817 | 1058 | 7504 |
| 2881 | 1082 | 7675 |
| 2549 | 957 | 6792 |
| 3450 | 1295 | 9191 |
| 2873 | 1079 | 7653 |
| 2535 | 952 | 6753 |
| 2535 | 952 | 6753 |
| 2889 | 1085 | 7697 |
| 2535 | 952 | 6753 |
| 2881 | 1082 | 7675 |
| 2479 | 931 | 6607 |
| 2857 | 1073 | 7611 |
| 2512 | 943 | 6691 |
| 2881 | 1082 | 7675 |
| 2841 | 1067 | 7568 |
| 2486 | 933 | 6622 |
| 2528 | 949 | 6734 |
| 3421 | 1284 | 9114 |
| 2865 | 1076 | 7632 |
| 2491 | 935 | 6635 |
| 2873 | 1079 | 7653 |
| 2507 | 941 | 6678 |
| 2505 | 940 | 6672 |
| 2881 | 1082 | 7675 |
| 3421 | 1284 | 9114 |
| 2535 | 952 | 6753 |
| 3877 | 1455 | 10329 |
| 2111 | 792 | 5625 |
| 2505 | 940 | 6672 |
| 2479 | 931 | 6603 |
| 2820 | 1058 | 7515 |
| 2542 | 954 | 6772 |
| 2542 | 954 | 6772 |
| 2865 | 1075 | 7634 |
| 2881 | 1081 | 7677 |
| 2833 | 1064 | 7547 |
| 2549 | 957 | 6792 |
| 3421 | 1284 | 9114 |
| 2841 | 1067 | 7568 |
| 3899 | 1463 | 10387 |
| 3431 | 1288 | 9140 |
| 2111 | 792 | 5625 |
| 2099 | 788 | 5593 |
| 2507 | 941 | 6678 |
| 2889 | 1085 | 7697 |
| 2841 | 1067 | 7568 |
| 2978 | 1118 | 7933 |
| 2873 | 1079 | 7653 |
| 2849 | 1070 | 7589 |
| 2849 | 1070 | 7589 |
| 2528 | 949 | 6734 |
| 2514 | 944 | 6697 |
| 2505 | 940 | 6672 |
| 2865 | 1076 | 7632 |
| 1858 | 697 | 4949 |
| 2857 | 1073 | 7611 |
| 2111 | 792 | 5625 |
| 3910 | 1468 | 10416 |
| 3373 | 1266 | 8987 |
| 2833 | 1064 | 7547 |
| 2068 | 776 | 5510 |
| 2849 | 1070 | 7589 |
| 2889 | 1085 | 7697 |
| 2500 | 938 | 6659 |
| 2498 | 938 | 6653 |
| 2873 | 1079 | 7653 |
| 2849 | 1070 | 7589 |
| 2535 | 952 | 6753 |
| 2519 | 946 | 6710 |
| 3333 | 1251 | 8879 |
| 2507 | 941 | 6678 |
| 2449 | 919 | 6524 |
| 2512 | 943 | 6691 |
| 2498 | 938 | 6653 |
| 3174 | 1191 | 8459 |
| 2528 | 949 | 6734 |
| 1507 | 562 | 4040 |
| 2889 | 1085 | 7697 |
| 2825 | 1061 | 7525 |
| 2865 | 1076 | 7632 |
| 2528 | 949 | 6734 |
| 2528 | 949 | 6734 |
| 2477 | 930 | 6597 |
| 1847 | 693 | 4921 |
| 3361 | 1262 | 8954 |
| 1815 | 681 | 4835 |
| 2489 | 934 | 6631 |
| 2873 | 1079 | 7653 |
| 2809 | 1055 | 7483 |
| 2500 | 938 | 6659 |
| 2857 | 1073 | 7611 |
| 2105 | 790 | 5609 |
| 2881 | 1082 | 7675 |
| 2519 | 946 | 6710 |
| 1815 | 681 | 4835 |
| 2111 | 792 | 5625 |
| 2841 | 1067 | 7568 |
| 2817 | 1058 | 7504 |
| 2500 | 938 | 6659 |
| 2873 | 1079 | 7653 |
| 2865 | 1076 | 7632 |
| 2406 | 903 | 6410 |
| 2809 | 1055 | 7483 |
| 2080 | 781 | 5541 |
| 2542 | 954 | 6772 |
| 2549 | 957 | 6792 |
| 2491 | 935 | 6635 |
| 2535 | 952 | 6753 |
| 2549 | 957 | 6792 |
| 2549 | 957 | 6792 |
| 2484 | 932 | 6616 |
| 1835 | 689 | 4890 |
| 2074 | 778 | 5526 |
| 1840 | 691 | 4903 |
| 2505 | 940 | 6672 |
| 1832 | 687 | 4880 |
Merging our prediction results with the the listing test data set:
listings_test_filtered_2 <- listings_test_filtered %>%
#keeping only columns which are relevant for further analysis
select(-c(2:43, 45)) %>%
#adding the estimates, upper and lower bounds
mutate(pred_price_fit = prediction_results[,1],
pred_price_lwr = prediction_results[,2],
pred_price_upr = prediction_results[,3],
#creates new variables containing diffrence between estimate and actual value
pred_difference = pred_price_fit - price_4_nights,
#crates new variable that checks how far off the actual values are from the mean price of the overall data set (mean = 7,799 $)
difference_mean = mean(listings$price_4_nights) - price_4_nights
)
listings_test_filtered_2%>%
kable()%>%
kable_styling()| id | price_4_nights | pred_price_fit | pred_price_lwr | pred_price_upr | pred_difference | difference_mean |
|---|---|---|---|---|---|---|
| 390198 | 3288 | 2519 | 946 | 6710 | -769.2 | 4511 |
| 667072 | 5750 | 2514 | 944 | 6697 | -3236.1 | 2049 |
| 3583417 | 1099 | 2521 | 946 | 6715 | 1421.9 | 6700 |
| 4034553 | 4976 | 2889 | 1085 | 7697 | -2086.7 | 2823 |
| 4135729 | 2828 | 2505 | 940 | 6672 | -323.3 | 4971 |
| 4402540 | 2868 | 2535 | 952 | 6753 | -332.9 | 4931 |
| 4577312 | 3370 | 2865 | 1076 | 7632 | -505.0 | 4429 |
| 4613660 | 2354 | 2528 | 949 | 6734 | 174.0 | 5445 |
| 4737256 | 4420 | 2849 | 1070 | 7589 | -1571.0 | 3379 |
| 4797434 | 2178 | 3383 | 1270 | 9012 | 1204.7 | 5621 |
| 5081314 | 3758 | 2512 | 943 | 6691 | -1246.3 | 4041 |
| 5107965 | 3361 | 2889 | 1085 | 7697 | -471.7 | 4438 |
| 5119748 | 3052 | 2817 | 1058 | 7504 | -234.9 | 4747 |
| 5325479 | 3865 | 2881 | 1082 | 7675 | -983.8 | 3934 |
| 5601379 | 1980 | 2549 | 957 | 6792 | 569.4 | 5819 |
| 5923679 | 1596 | 3450 | 1295 | 9191 | 1853.9 | 6203 |
| 6070723 | 2006 | 2873 | 1079 | 7653 | 867.1 | 5793 |
| 6193802 | 2972 | 2535 | 952 | 6753 | -436.9 | 4827 |
| 6235176 | 1886 | 2535 | 952 | 6753 | 649.1 | 5913 |
| 6521201 | 4274 | 2889 | 1085 | 7697 | -1384.7 | 3525 |
| 6918629 | 4050 | 2535 | 952 | 6753 | -1514.9 | 3749 |
| 7449161 | 6936 | 2881 | 1082 | 7675 | -4054.8 | 863 |
| 7608038 | 2904 | 2479 | 931 | 6607 | -424.5 | 4895 |
| 8076245 | 3772 | 2857 | 1073 | 7611 | -915.0 | 4027 |
| 8319060 | 2196 | 2512 | 943 | 6691 | 315.7 | 5603 |
| 8479864 | 4128 | 2881 | 1082 | 7675 | -1246.8 | 3671 |
| 8626291 | 1776 | 2841 | 1067 | 7568 | 1065.0 | 6023 |
| 9603512 | 2600 | 2486 | 933 | 6622 | -114.2 | 5199 |
| 9730452 | 2796 | 2528 | 949 | 6734 | -268.0 | 5003 |
| 9869093 | 4274 | 3421 | 1284 | 9114 | -853.1 | 3525 |
| 10118408 | 3268 | 2865 | 1076 | 7632 | -403.0 | 4531 |
| 10171027 | 1388 | 2491 | 935 | 6635 | 1102.6 | 6411 |
| 10209764 | 2516 | 2873 | 1079 | 7653 | 357.1 | 5283 |
| 10370371 | 2320 | 2507 | 941 | 6678 | 186.8 | 5479 |
| 10377262 | 1576 | 2505 | 940 | 6672 | 928.7 | 6223 |
| 10377744 | 2604 | 2881 | 1082 | 7675 | 277.2 | 5195 |
| 10665293 | 2087 | 3421 | 1284 | 9114 | 1333.9 | 5712 |
| 11004889 | 1500 | 2535 | 952 | 6753 | 1035.1 | 6299 |
| 11561268 | 2416 | 3877 | 1455 | 10329 | 1461.0 | 5383 |
| 11613223 | 2568 | 2111 | 792 | 5625 | -456.9 | 5231 |
| 12047000 | 2200 | 2505 | 940 | 6672 | 304.7 | 5599 |
| 12311907 | 1538 | 2479 | 931 | 6603 | 940.8 | 6261 |
| 12879874 | 1447 | 2820 | 1058 | 7515 | 1373.3 | 6352 |
| 13390581 | 5092 | 2542 | 954 | 6772 | -2549.7 | 2707 |
| 13424912 | 2212 | 2542 | 954 | 6772 | 330.3 | 5587 |
| 13427320 | 1946 | 2865 | 1075 | 7634 | 918.7 | 5853 |
| 13556565 | 4278 | 2881 | 1081 | 7677 | -1397.1 | 3521 |
| 13589317 | 1804 | 2833 | 1064 | 7547 | 1029.0 | 5995 |
| 14001895 | 4324 | 2549 | 957 | 6792 | -1774.6 | 3475 |
| 14099119 | 2141 | 3421 | 1284 | 9114 | 1279.9 | 5658 |
| 14174713 | 2196 | 2841 | 1067 | 7568 | 645.0 | 5603 |
| 14174782 | 1388 | 3899 | 1463 | 10387 | 2510.8 | 6411 |
| 14197317 | 2212 | 3431 | 1288 | 9140 | 1218.6 | 5587 |
| 14205736 | 1946 | 2111 | 792 | 5625 | 165.1 | 5853 |
| 14205988 | 1530 | 2099 | 788 | 5593 | 569.2 | 6269 |
| 14295791 | 2568 | 2507 | 941 | 6678 | -61.2 | 5231 |
| 14904443 | 2162 | 2889 | 1085 | 7697 | 727.3 | 5637 |
| 15032356 | 2808 | 2841 | 1067 | 7568 | 33.0 | 4991 |
| 15513609 | 2104 | 2978 | 1118 | 7933 | 873.7 | 5695 |
| 15650593 | 4784 | 2873 | 1079 | 7653 | -1910.9 | 3015 |
| 15793161 | 2220 | 2849 | 1070 | 7589 | 629.0 | 5579 |
| 15827702 | 1596 | 2849 | 1070 | 7589 | 1253.0 | 6203 |
| 15986335 | 3054 | 2528 | 949 | 6734 | -526.0 | 4745 |
| 16022861 | 2012 | 2514 | 944 | 6697 | 501.9 | 5787 |
| 16357189 | 6032 | 2505 | 940 | 6672 | -3527.3 | 1767 |
| 16425395 | 2204 | 2865 | 1076 | 7632 | 661.0 | 5595 |
| 16492890 | 2288 | 1858 | 697 | 4949 | -430.5 | 5511 |
| 16715956 | 2004 | 2857 | 1073 | 7611 | 853.0 | 5795 |
| 16762435 | 1868 | 2111 | 792 | 5625 | 242.9 | 5931 |
| 16891977 | 2789 | 3910 | 1468 | 10416 | 1120.8 | 5010 |
| 17289226 | 2262 | 3373 | 1266 | 8987 | 1111.2 | 5537 |
| 17487415 | 2288 | 2833 | 1064 | 7547 | 545.0 | 5511 |
| 17563634 | 2568 | 2068 | 776 | 5510 | -500.0 | 5231 |
| 17571972 | 2212 | 2849 | 1070 | 7589 | 637.0 | 5587 |
| 17915175 | 2144 | 2889 | 1085 | 7697 | 745.3 | 5655 |
| 18107701 | 1174 | 2500 | 938 | 6659 | 1325.8 | 6625 |
| 18150180 | 2080 | 2498 | 938 | 6653 | 417.7 | 5719 |
| 18151764 | 2022 | 2873 | 1079 | 7653 | 851.1 | 5777 |
| 18424494 | 1540 | 2849 | 1070 | 7589 | 1309.0 | 6259 |
| 18698721 | 2444 | 2535 | 952 | 6753 | 91.1 | 5355 |
| 18998259 | 2004 | 2519 | 946 | 6710 | 514.8 | 5795 |
| 19093535 | 4946 | 3333 | 1251 | 8879 | -1613.3 | 2853 |
| 19395647 | 2448 | 2507 | 941 | 6678 | 58.8 | 5351 |
| 19513486 | 3952 | 2449 | 919 | 6524 | -1503.0 | 3847 |
| 19543437 | 1746 | 2512 | 943 | 6691 | 765.7 | 6053 |
| 19668020 | 832 | 2498 | 938 | 6653 | 1665.7 | 6967 |
| 20026625 | 1824 | 3174 | 1191 | 8459 | 1350.2 | 5975 |
| 20061001 | 1976 | 2528 | 949 | 6734 | 552.0 | 5823 |
| 20578280 | 1463 | 1507 | 562 | 4040 | 44.5 | 6336 |
| 20595620 | 2738 | 2889 | 1085 | 7697 | 151.3 | 5061 |
| 20894159 | 1180 | 2825 | 1061 | 7525 | 1645.1 | 6619 |
| 20927864 | 2074 | 2865 | 1076 | 7632 | 791.0 | 5725 |
| 21136471 | 1388 | 2528 | 949 | 6734 | 1140.0 | 6411 |
| 21184207 | 3222 | 2528 | 949 | 6734 | -694.0 | 4577 |
| 21453143 | 2412 | 2477 | 930 | 6597 | 64.7 | 5387 |
| 21547885 | 1444 | 1847 | 693 | 4921 | 403.1 | 6355 |
| 21572032 | 4875 | 3361 | 1262 | 8954 | -1514.1 | 2924 |
| 21743669 | 1520 | 1815 | 681 | 4835 | 294.7 | 6279 |
| 21775447 | 1888 | 2489 | 934 | 6631 | 600.7 | 5911 |
| 21894479 | 3172 | 2873 | 1079 | 7653 | -298.9 | 4627 |
| 22037476 | 2797 | 2809 | 1055 | 7483 | 12.2 | 5002 |
| 22512543 | 1732 | 2500 | 938 | 6659 | 767.8 | 6067 |
| 22984673 | 2052 | 2857 | 1073 | 7611 | 805.0 | 5747 |
| 23240465 | 2008 | 2105 | 790 | 5609 | 97.1 | 5791 |
| 23429658 | 2464 | 2881 | 1082 | 7675 | 417.2 | 5335 |
| 23460305 | 1174 | 2519 | 946 | 6710 | 1344.8 | 6625 |
| 24172166 | 1528 | 1815 | 681 | 4835 | 286.7 | 6271 |
| 24887007 | 2220 | 2111 | 792 | 5625 | -109.1 | 5579 |
| 24999034 | 2299 | 2841 | 1067 | 7568 | 542.0 | 5500 |
| 26437576 | 3372 | 2817 | 1058 | 7504 | -554.9 | 4427 |
| 26463385 | 2680 | 2500 | 938 | 6659 | -180.2 | 5119 |
| 27039679 | 1904 | 2873 | 1079 | 7653 | 969.1 | 5895 |
| 27368942 | 2430 | 2865 | 1076 | 7632 | 435.0 | 5369 |
| 27595742 | 3148 | 2406 | 903 | 6410 | -742.2 | 4651 |
| 27943381 | 2307 | 2809 | 1055 | 7483 | 502.2 | 5492 |
| 28200071 | 2776 | 2080 | 781 | 5541 | -696.3 | 5023 |
| 30557302 | 2220 | 2542 | 954 | 6772 | 322.3 | 5579 |
| 31287691 | 2152 | 2549 | 957 | 6792 | 397.4 | 5647 |
| 31478216 | 1892 | 2491 | 935 | 6635 | 598.6 | 5907 |
| 31890053 | 2247 | 2535 | 952 | 6753 | 288.1 | 5552 |
| 34078010 | 2000 | 2549 | 957 | 6792 | 549.4 | 5799 |
| 34348912 | 1988 | 2549 | 957 | 6792 | 561.4 | 5811 |
| 34558453 | 3020 | 2484 | 932 | 6616 | -536.3 | 4779 |
| 36925349 | 1388 | 1835 | 689 | 4890 | 447.2 | 6411 |
| 37439353 | 2143 | 2074 | 778 | 5526 | -69.1 | 5656 |
| 38962365 | 1814 | 1840 | 691 | 4903 | 26.4 | 5985 |
| 39934648 | 2122 | 2505 | 940 | 6672 | 382.7 | 5677 |
| 39968107 | 1463 | 1832 | 687 | 4880 | 368.6 | 6336 |
Summary Statistics:
In this section we will briefly run through some summary statistics to get a feel for how well we predicted the actual price for 4 nights.
# In this table we compare the average 4 night price of the testing data set with the average of our predictions
listings_test_filtered_2 %>%
summarize(mean_pred = mean(pred_price_fit),
mean_actual = mean(price_4_nights))| mean_pred | mean_actual |
|---|---|
| 2.66e+03 | 2.54e+03 |
# In this table we look at the average amount our predictions differ from the actual listing prices
# We then compare it to the average difference to the overall average price to confirm that our model acts as a better predictor of price than the average
listings_test_filtered_2 %>%
summarize(using_model = mean(pred_difference),
using_mean = mean(difference_mean))| using_model | using_mean |
|---|---|
| 118 | 5.26e+03 |
The first table shows that on average, our model slightly underestimates the true mean price of the test data set (diff of 118 USD).
The second table gives us a rough estimate on how our model’s prediction performance as compared to just using the overall mean price of the original data set. while our model slightly overestimates the true values (on average by 118 USD), an prediction model simply using the overall mean price would heavily overstate the prices (on average by 5256 USD)
Plotting
# We visualize the distributions of the actual prices and our predictions
listings_test_filtered_2 %>%
#in order to plot the data via ggplot, we have to put the data in "tidy" format, using the pivot longer function
pivot_longer(names_to = "pred_vs_actual", values_to = "value", cols = c(price_4_nights, pred_price_fit)) %>%
#we can then continue plotting
ggplot(aes(x=value, fill = pred_vs_actual)) +
geom_density(alpha = 0.3)# The second graph investigates how the difference between prediction and actual values is connect to the actual price for 4 nights
ggplot(listings_test_filtered_2, aes(x=price_4_nights, y=pred_difference)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_smooth(se =FALSE, method = lm) +
labs(title = "The estimation errors seem to be correlated with the price", y="Difference between estimate and actual price", x="Actual price") In the first plot we can see the density plot for our estimates and the actual values. The distribution of the actual prices look heavily right skewed, while the model predictions are distributed symetrically.
In the second plot we can see that the prediction error seem to be correlated to the actual price of the listing. The higher the price of the listing, the more our model will underestimate the price.
This could be a sign of a variable currently not included in our model, that would potentially explain this tendency.
Overall however, our final model delivers a good estimate on what how much a couple would have to spend on rent during a 4 day trip to beautiful Cape Town!
Statista. 2020. Global AC Penetration Rate By Country 2016 | Statista. [online] Available at: https://www.statista.com/statistics/911064/worldwide-air-conditioning-penetration-rate-country/ [Accessed 17 September 2020].